Multiple regression

Gabriel Singer

23.02.2022

Multiple linear regression: Model

Multiple metric continuous independent predictors are used to predict one metric response variable.

Already met 2-way ANOVA and ANCOVA as models with >1 predictor.


Model formulation as a linear combination of k predictors: \[ Y=\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_kX_k \] \(\beta_0\) is the intercept \(\beta_1...\beta_k\) are partial regression coefficients for \(X_1...X_k\), expressing the change in \(Y\) per unit change in a particular \(X_j\) holding all other \(X\)-variables constant (for instance, effect of \(X_1\) while adjusting for any effects of \(X_2...X_k\)).

\(\beta_0...\beta_k\) are estimated by \(b_0...b_k\) from the sample.

Using matrix notation:

\[ \hat{y} = \mathbf{X}\mathbf{B} \]

\(\mathbf{B}\) is the parameter vector

\(\mathbf{X}\) is the design matrix

\[ \mathbf{X} = \begin{bmatrix} 1 & x_{1,1} & x_{2,1} & \dots & x_{k,1} \\ 1 & x_{1,2} & x_{2,2} & \dots & x_{k,2} \\ \vdots & \vdots & \vdots & \dots & \vdots \\ 1 & x_{1,n} & x_{2,n} & \dots & x_{k,n} \\ \end{bmatrix} \]


\[ \begin{aligned} \hat{y} & = \mathbf{X}\mathbf{B} \\ y &= \mathbf{X}\mathbf{B} + \epsilon \\ \epsilon & \sim \mathcal{N}(0, s_\epsilon) \end{aligned} \]

The equation is a linear system and can be solved with linear algebra by OLS, minimizing the sum of squared residuals:

\[ \mathrm{min}: \sum \epsilon^2 = \sum \left (\mathbf{X}\mathbf{B} - y \right)^2 \]

Multiple linear regression: Hypotheses

  1. \(\mathbf{H_{0,regression}}\): All partial regression coefficients are zero. The model (i.e., the linear system \(\mathbf{XB}\)) does not explain any of the variation in \(y\).

Tested by ANOVA (as in simple LR): \(SS_{reg}\) and \(SS_{res}\) converted to \(MS_{reg}\) and \(MS_{res}\), then F-ratio of explained to residual variance to get P.

  1. \(\mathbf{H_{0,coef_j}}\): An individual partial regression coefficient, the slope of the relationship between \(y\) and \(x_j\), is zero. k predictors give rise to k such hypotheses.

Tested with \(t\)-statistic (in output of lm) or by forming CIs for each partial regression coefficient and check overlap with 0. Alternatively: Use nested ANOVA models to compare fit of a full and a reduced model (without the predictor in question) and check for a significant change of \(SS_{reg}\) using a partial F-test.


Explained variance: multiple \(r^2\) (analogous to simple regression) = “the percentage of variation of Y explained by all X variables.

BUT: More predictors (even if random) always cause higher \(r^2\), therefore adjusted \(r^2\), which accounts for the number of predictors.

\[ {r^2}_{adj}=1-\frac{SS_{res}}{SS_{tot}}\cdot\frac{n-1}{n-k} \] The idea is to balance the increment in \(r^2\) with the additional complexity of the model. The cost of adding an additional predictor must be balanced by an actual gain in \(r^2\).

Which predictors are important?

  1. Consider dropping non-significant predictors
    • unless you know/strongly suspect they are important
    • \(x_1\) might be significant, but only when \(x_2\) is in the model!
    • interactions (more on this later) must always have the main effect (\(b_1x_1 + b_2x_2 + b_3x_1x_2\))
    • higher-order polynomials should always have lower-order (\(b_1x+b_2x^2\))
  2. Compare slopes in a scale-independent way
    • standardize your predictors (especially if \(s_{x_1} >> s_{x_2}\)), or
    • standardize the coefficients

\[ {b_j}^*=b_j\frac{s_{X_j}}{s_Y} \]

Higher \({b_j}^*\) means stronger influence of \(X_j\). Note that in software output \({b_j}^*\) is often referred to as \(\beta_j\).

Model building

Classical tools to search for important predictors to include: forward and backward stepwise modelling with addition or dropping of predictors based on F or P-values (drop or add or step).


Model building: The extensive search to identify important predictors using leads away from strict hypothesis testing. Prediction quality counts more than significance testing, thus also insignificant predictors may be worthwhile to keep and extensive model building may happen.


Overfitting: In an overfitted model a predictor contributes to an increase of \(r^2\) by random noise in the specific dataset used to build the model.

The inclusion of irrelevant predictors:

Ways out:

  1. Use \({r^2}_{adj}\) when judging model fit and comparing models: Large differences between \(r^2\) and \({r^2}_{adj}\) point to an overfitted model.

  2. Use information criteria to compare models: The Akaike Information criterion (AIC) aims to identify a most parsimonious model, i.e. the model achieving the best fit with the lowest complexity (=number of predictors). Similarly to \({r^2}_{adj}\) it is computed by “punishing” a measure of model fit with a term involving the number of predictors k:

\[ AIC=n\cdot{\ln{SS_{res}}}+2(k+1)-n\cdot{\ln{n}} \] We aim for models with low AIC. Models with a \(\Delta{AIC}<2\) are considered equivalent.


Validation techniques:

Pitfalls: Model flexibility

The curse of dimensionality


Assume predictors \(X_1\) and \(X_2\) and limited sampling effort of n=16:

  1. When studying only one predictor, we can cover its entire range of interest well.
  2. When studying two predictors with the same effort, our samples are dispersed in the 2D-space. We can´t get the same density but still cover the entire 2D-space in a regular grid.
  3. The more likely reality produces a dispersed distribution over the 2D-space with well and less well covered areas. The data becomes sparse. Maintaining high sampling density becomes increasingly difficult when more than two 2 dimensions are involved. We don´t cover our predictors well enough anymore!


Additional assumptions

Highly correlating predictor variables cause model instability, large CIs for regression slopes, unsure importance of predictors (but could still be good overall model).

Assessing collinearity by VIF: Ignoring \(y\) for a moment, we can perform regressions of the \(x\) variables against each other:

\[ x_i = b_0 + b_1x_1 \dots b_kx_k +\epsilon \mathrm{~;~excluding~x_i} \]

Large \(R^2_i\) would argue for redundancy of \(x_i\) (its information is already contained in a linear combination of all other \(x\)-variables). \(VIF_i\) is a transformation of \(R^2_i\): \[ \mathrm{VIF}_i = \frac{1}{1-R^2_i} \]

Matt, please cast your equation of how VIF affects SE of regression coefficients into words. My explanation is likely not correct. I don´t understand the rho=0 in the equation.

The VIF (name!) tells you by how much the SE of a regression coefficient increases due to inclusion of additional \(x\)-variables:

\[ s_b = s_{b,\rho=0}\sqrt{\mathrm{VIF}} \]

Get VIF with car::vif(full_model) for all predictors of a particular model.

MLR: Results presentation

Geometrical representation / graphing:

  1. A plane in 3D-space for two predictors, but higher-order surfaces are needed (and not printable) for >2 predictors.
  2. 2 predictors defining a plane, response as color gradient in dots or by using contour lines.
  3. Often observed vs. predicted values (should fall on a 1:1 line) are graphed to at least provide a graphical assessment of model quality.

MLR in R

# data from limnology: SPM in lakes
# required: MLR to predict SPM from other variables
data<-read.table(file="data/LakeSPM.txt",header=TRUE)[,-c(1:2)] 
# remove first two variables (just lakename and # in dataset)

plot(data)   # correlations? collinearity?

boxplot(scale(data)) # distributions?

data[,-c(5,10)]<-log(data[,-c(5,10)]) # transform all except pH and Vd

# dropping colinear variables using VIF
mlr.full<-lm(spm~.,data=data) # makes a model with all predictors
library(car)
## Loading required package: carData
vif(mlr.full)
##         area           Dm         Dmax           pH           TP            T 
## 1.259843e+06 9.310628e+05 3.947461e+02 2.337318e+00 3.883131e+00 3.576987e+02 
##            Q           DR           Vd 
## 5.275149e+02 1.034419e+06 4.269601e+01
# check high VIF for area
summary(lm(area~.-spm,data=data)) # -> r2 is 1!
## 
## Call:
## lm(formula = area ~ . - spm, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0053403 -0.0008962  0.0003460  0.0008821  0.0040917 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.894e-04  5.712e-02  -0.010    0.992    
## Dm           2.005e+00  1.526e-02 131.403   <2e-16 ***
## Dmax        -2.862e-03  9.258e-03  -0.309    0.761    
## pH           1.243e-04  1.002e-03   0.124    0.903    
## TP           7.409e-05  1.685e-03   0.044    0.965    
## T           -2.094e-04  3.995e-03  -0.052    0.959    
## Q           -1.008e-04  3.802e-03  -0.027    0.979    
## DR           1.999e+00  7.636e-03 261.836   <2e-16 ***
## Vd           9.416e-04  8.395e-03   0.112    0.912    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002301 on 17 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.677e+06 on 8 and 17 DF,  p-value: < 2.2e-16
# continue dropping variables until VIF <5-10
del<-which(names(data) %in% c("area","Dm","Dmax"))
data<-data[,-del]
mlr.full<-lm(spm~.,data=data) # refit after variable deletion

# start with simple regression model 
simple.mod1<-lm(spm~TP,data=data)
summary(simple.mod1)
## 
## Call:
## lm(formula = spm ~ TP, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90449 -0.28309 -0.08777  0.17917  1.20813 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.8724     0.5793  -6.685 6.48e-07 ***
## TP            1.5517     0.1891   8.206 2.00e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5089 on 24 degrees of freedom
## Multiple R-squared:  0.7373, Adjusted R-squared:  0.7263 
## F-statistic: 67.35 on 1 and 24 DF,  p-value: 2.004e-08
plot(spm~TP,data=data)
abline(simple.mod1,col="red")


# forward stepwise MLR building using F
# start with TP and see if other terms can be added to the model
names(data)
## [1] "spm" "pH"  "TP"  "T"   "Q"   "DR"  "Vd"
add1(simple.mod1,scope=~pH+TP+T+Q+DR+Vd,test="F") 
## Single term additions
## 
## Model:
## spm ~ TP
##        Df Sum of Sq    RSS     AIC F value   Pr(>F)   
## <none>              6.2163 -33.204                    
## pH      1   2.02398 4.1923 -41.446 11.1041 0.002896 **
## T       1   0.41143 5.8048 -32.985  1.6302 0.214421   
## Q       1   0.00925 6.2070 -31.243  0.0343 0.854777   
## DR      1   1.96782 4.2484 -41.100 10.6533 0.003414 **
## Vd      1   0.79150 5.4248 -34.745  3.3558 0.079950 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# adds single terms and compares models with an F-test

mlr1<-lm(spm~TP+pH,data=data)
summary(mlr1)
## 
## Call:
## lm(formula = spm ~ TP + pH, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86284 -0.24187 -0.04459  0.18318  0.91888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.5089     0.9285  -7.010 3.83e-07 ***
## TP            1.4664     0.1607   9.127 4.16e-09 ***
## pH            0.4105     0.1232   3.332   0.0029 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4269 on 23 degrees of freedom
## Multiple R-squared:  0.8228, Adjusted R-squared:  0.8074 
## F-statistic:  53.4 on 2 and 23 DF,  p-value: 2.275e-09
# -> good model, try adding more terms

# backward stepwise MLR building using F
# start with full model (all terms) and see if terms can be dropped
summary(mlr.full)
## 
## Call:
## lm(formula = spm ~ ., data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92577 -0.11178  0.02268  0.20035  0.76549 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -3.60757    1.29121  -2.794  0.01158 * 
## pH           0.25790    0.16681   1.546  0.13860   
## TP           1.02714    0.26827   3.829  0.00113 **
## T           -0.04065    0.05496  -0.740  0.46863   
## Q           -0.02767    0.04319  -0.641  0.52947   
## DR           0.36796    0.12515   2.940  0.00840 **
## Vd           0.31319    0.32422   0.966  0.34619   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3871 on 19 degrees of freedom
## Multiple R-squared:  0.8797, Adjusted R-squared:  0.8417 
## F-statistic: 23.15 on 6 and 19 DF,  p-value: 8.788e-08
drop1(mlr.full,test="F")
## Single term deletions
## 
## Model:
## spm ~ pH + TP + T + Q + DR + Vd
##        Df Sum of Sq    RSS     AIC F value   Pr(>F)   
## <none>              2.8466 -43.511                    
## pH      1   0.35810 3.2047 -42.430  2.3901 0.138597   
## TP      1   2.19631 5.0429 -30.643 14.6594 0.001133 **
## T       1   0.08194 2.9286 -44.773  0.5469 0.468633   
## Q       1   0.06147 2.9081 -44.955  0.4103 0.529466   
## DR      1   1.29509 4.1417 -35.762  8.6441 0.008404 **
## Vd      1   0.13980 2.9864 -44.264  0.9331 0.346193   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# -> drop Q (lowest F, highest P, also makes lowest AIC)
mlr2<-lm(spm~pH+TP+T+DR+Vd,data=data)
summary(mlr2)
## 
## Call:
## lm(formula = spm ~ pH + TP + T + DR + Vd, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88293 -0.20247  0.00856  0.23363  0.81997 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -3.80213    1.23633  -3.075  0.00597 **
## pH           0.21254    0.14880   1.428  0.16861   
## TP           0.98497    0.25620   3.844  0.00101 **
## T           -0.03584    0.05364  -0.668  0.51164   
## DR           0.35463    0.12158   2.917  0.00853 **
## Vd           0.42131    0.27270   1.545  0.13804   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3813 on 20 degrees of freedom
## Multiple R-squared:  0.8771, Adjusted R-squared:  0.8464 
## F-statistic: 28.54 on 5 and 20 DF,  p-value: 1.869e-08
# continue until no more dropping...

# automatic stepwise search based on AIC
step(mlr.full,direction="backward")
## Start:  AIC=-43.51
## spm ~ pH + TP + T + Q + DR + Vd
## 
##        Df Sum of Sq    RSS     AIC
## - Q     1   0.06147 2.9081 -44.955
## - T     1   0.08194 2.9286 -44.773
## - Vd    1   0.13980 2.9864 -44.264
## <none>              2.8466 -43.511
## - pH    1   0.35810 3.2047 -42.430
## - DR    1   1.29509 4.1417 -35.762
## - TP    1   2.19631 5.0429 -30.643
## 
## Step:  AIC=-44.96
## spm ~ pH + TP + T + DR + Vd
## 
##        Df Sum of Sq    RSS     AIC
## - T     1   0.06492 2.9730 -46.381
## <none>              2.9081 -44.955
## - pH    1   0.29667 3.2048 -44.430
## - Vd    1   0.34706 3.2552 -44.024
## - DR    1   1.23717 4.1453 -37.739
## - TP    1   2.14909 5.0572 -32.569
## 
## Step:  AIC=-46.38
## spm ~ pH + TP + DR + Vd
## 
##        Df Sum of Sq    RSS     AIC
## <none>              2.9730 -46.381
## - pH    1    0.2469 3.2200 -46.307
## - Vd    1    0.2913 3.2643 -45.951
## - DR    1    1.1823 4.1554 -39.676
## - TP    1    4.5877 7.5608 -24.113
## 
## Call:
## lm(formula = spm ~ pH + TP + DR + Vd, data = data)
## 
## Coefficients:
## (Intercept)           pH           TP           DR           Vd  
##     -3.8703       0.1878       1.0959       0.3433       0.3709
step(simple.mod1,scope=~pH+TP+T+Q+DR+Vd,direction="forward")
## Start:  AIC=-33.2
## spm ~ TP
## 
##        Df Sum of Sq    RSS     AIC
## + pH    1   2.02398 4.1923 -41.446
## + DR    1   1.96782 4.2484 -41.100
## + Vd    1   0.79150 5.4248 -34.745
## <none>              6.2163 -33.204
## + T     1   0.41143 5.8048 -32.985
## + Q     1   0.00925 6.2070 -31.243
## 
## Step:  AIC=-41.45
## spm ~ TP + pH
## 
##        Df Sum of Sq    RSS     AIC
## + DR    1   0.92796 3.2643 -45.951
## <none>              4.1923 -41.446
## + Vd    1   0.03692 4.1554 -39.676
## + Q     1   0.02363 4.1686 -39.593
## + T     1   0.00252 4.1898 -39.462
## 
## Step:  AIC=-45.95
## spm ~ TP + pH + DR
## 
##        Df Sum of Sq    RSS     AIC
## + Vd    1  0.291287 2.9730 -46.381
## <none>              3.2643 -45.951
## + Q     1  0.225708 3.0386 -45.814
## + T     1  0.009151 3.2552 -44.024
## 
## Step:  AIC=-46.38
## spm ~ TP + pH + DR + Vd
## 
##        Df Sum of Sq    RSS     AIC
## <none>              2.9730 -46.381
## + T     1  0.064924 2.9081 -44.955
## + Q     1  0.044461 2.9286 -44.773
## 
## Call:
## lm(formula = spm ~ TP + pH + DR + Vd, data = data)
## 
## Coefficients:
## (Intercept)           TP           pH           DR           Vd  
##     -3.8703       1.0959       0.1878       0.3433       0.3709
# same solutions! compare with selection by F!

# combined forward and backward selection based on AIC
step(mlr.full,direction="both")
## Start:  AIC=-43.51
## spm ~ pH + TP + T + Q + DR + Vd
## 
##        Df Sum of Sq    RSS     AIC
## - Q     1   0.06147 2.9081 -44.955
## - T     1   0.08194 2.9286 -44.773
## - Vd    1   0.13980 2.9864 -44.264
## <none>              2.8466 -43.511
## - pH    1   0.35810 3.2047 -42.430
## - DR    1   1.29509 4.1417 -35.762
## - TP    1   2.19631 5.0429 -30.643
## 
## Step:  AIC=-44.96
## spm ~ pH + TP + T + DR + Vd
## 
##        Df Sum of Sq    RSS     AIC
## - T     1   0.06492 2.9730 -46.381
## <none>              2.9081 -44.955
## - pH    1   0.29667 3.2048 -44.430
## - Vd    1   0.34706 3.2552 -44.024
## + Q     1   0.06147 2.8466 -43.511
## - DR    1   1.23717 4.1453 -37.739
## - TP    1   2.14909 5.0572 -32.569
## 
## Step:  AIC=-46.38
## spm ~ pH + TP + DR + Vd
## 
##        Df Sum of Sq    RSS     AIC
## <none>              2.9730 -46.381
## - pH    1    0.2469 3.2200 -46.307
## - Vd    1    0.2913 3.2643 -45.951
## + T     1    0.0649 2.9081 -44.955
## + Q     1    0.0445 2.9286 -44.773
## - DR    1    1.1823 4.1554 -39.676
## - TP    1    4.5877 7.5608 -24.113
## 
## Call:
## lm(formula = spm ~ pH + TP + DR + Vd, data = data)
## 
## Coefficients:
## (Intercept)           pH           TP           DR           Vd  
##     -3.8703       0.1878       1.0959       0.3433       0.3709
mlr3<-lm(spm~pH+TP+DR+Vd,data=data)
AIC(mlr3)
## [1] 29.40346
extractAIC(mlr3)
## [1]   5.00000 -46.38135